The oriignal data used in this tutorial can be found on the GitHub page of Rforwards: https://github.com/forwards/teaching_examples/tree/master/AFLW.
In this tutorial you will learn:
You will learn these statistical concepts and techniques by exploring the AFL Women dataset taken from the 2017 and 2018 season.
We refer to a variable as to a set of observations. For example, imagine collecting the Age from all students in your class. The list of all the ages of your friends can recorded into a column of an excel spreadsheet and you will refer to it as to variable Age. Each entry ( = row, age for one student) of the variable age is referred to as observation.
Categorical variables contain a finite number of categories or distinct groups. For example, the name of the football team, the gender of the player, the colour of the team. These variables are not intrinsically number.
Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. For example, the number of customer visiting a pharmacy in a day, the number of players in a team, the number of siblings per student in your class.
Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. For example, the heigths of trees in your school, the time when you wake up in the morning.
Let’s read the AFLW spreadsheet into R and test your understanding of the different types of variables.
Note: Each function that you use in R belongs to a package that you need to lead through before you can use that function
library(readr) # Load the package 'readr' in order to read .csv files into R
players <- read_csv("data/players.csv")
colnames(players) <- gsub(" ","_",colnames(players))
colnames(players)[colnames(players) %in% "Time_On_Ground_%"] <- "Time_On_Ground_prop"
Print the first 5 roes of the players dataset.
library(knitr) # package knitr allows to print a dataset on screen in a nicer way. Compare the two ways below.
head(players)
## # A tibble: 6 x 45
## Player Club Kicks_TOT Kicks_AVG Handballs_TOT Handballs_AVG
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Aasta O'Connor WB 9 2.3 14 3.5
## 2 Abbey Holmes ADEL 35 4.4 38 4.8
## 3 Aimee Schmidt GWS 21 3 17 2.4
## 4 Ainslie Kemp MELB 21 5.3 9 2.3
## 5 Akec Makur Chuot FRE 29 4.8 8 1.3
## 6 Alex Williams GWS 47 6.7 20 2.9
## # ... with 39 more variables: Disposals_TOT <int>, Disposals_AVG <dbl>,
## # Cont_Poss_TOT <int>, Cont_Poss_AVG <dbl>, Uncont_Poss_TOT <int>,
## # Uncont_Poss_AVG <dbl>, `Disp_eff_%` <dbl>, Clangers_TOT <int>,
## # Clangers_AVG <dbl>, Marks_TOT <int>, Marks_AVG <dbl>,
## # Cont_marks_TOT <int>, Cont_marks_AVG <dbl>, Marks50_TOT <int>,
## # Marks50_AVG <dbl>, `Hit-outs_TOT` <int>, `Hit-outs_AVG` <dbl>,
## # Clearances_TOT <int>, Clearances_AVG <dbl>, Frees_For_TOT <int>,
## # Frees_For_AVG <dbl>, Frees_Agst_TOT <int>, Frees_Agst_AVG <dbl>,
## # Tackles_TOT <int>, Tackles_AVG <dbl>, `One_%s_TOT` <int>,
## # `One_%s_AVG` <dbl>, Bounces_TOT <int>, Bounces_AVG <dbl>,
## # Goals_TOT <int>, Goals_AVG <dbl>, Behinds_TOT <int>,
## # Behinds_AVG <dbl>, Goal_assists_TOT <int>, Goal_assists_AVG <dbl>,
## # `Goal_acc_%` <dbl>, Matches <int>, Time_On_Ground_prop <dbl>,
## # Year <int>
kable(head(players))
| Player | Club | Kicks_TOT | Kicks_AVG | Handballs_TOT | Handballs_AVG | Disposals_TOT | Disposals_AVG | Cont_Poss_TOT | Cont_Poss_AVG | Uncont_Poss_TOT | Uncont_Poss_AVG | Disp_eff_% | Clangers_TOT | Clangers_AVG | Marks_TOT | Marks_AVG | Cont_marks_TOT | Cont_marks_AVG | Marks50_TOT | Marks50_AVG | Hit-outs_TOT | Hit-outs_AVG | Clearances_TOT | Clearances_AVG | Frees_For_TOT | Frees_For_AVG | Frees_Agst_TOT | Frees_Agst_AVG | Tackles_TOT | Tackles_AVG | One_%s_TOT | One_%s_AVG | Bounces_TOT | Bounces_AVG | Goals_TOT | Goals_AVG | Behinds_TOT | Behinds_AVG | Goal_assists_TOT | Goal_assists_AVG | Goal_acc_% | Matches | Time_On_Ground_prop | Year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aasta O’Connor | WB | 9 | 2.3 | 14 | 3.5 | 23 | 5.8 | 12 | 3.0 | 12 | 3.0 | 65.2 | 8 | 2.0 | 4 | 1.0 | 0 | 0.0 | 2 | 0.5 | 24 | 6 | 0 | 0.0 | 1 | 0.3 | 3 | 0.8 | 6 | 1.5 | 6 | 1.5 | 0 | 0.0 | 1 | 0.3 | 0 | 0.0 | 1 | 0.3 | 100 | 4 | 73.6 | 2017 |
| Abbey Holmes | ADEL | 35 | 4.4 | 38 | 4.8 | 73 | 9.1 | 51 | 6.4 | 27 | 3.4 | 52.1 | 17 | 2.1 | 9 | 1.1 | 4 | 0.5 | 2 | 0.3 | 0 | 0 | 5 | 0.6 | 8 | 1.0 | 2 | 0.3 | 16 | 2.0 | 5 | 0.6 | 0 | 0.0 | 2 | 0.3 | 2 | 0.3 | 2 | 0.3 | 40 | 8 | 64.5 | 2017 |
| Aimee Schmidt | GWS | 21 | 3.0 | 17 | 2.4 | 38 | 5.4 | 13 | 1.9 | 23 | 3.3 | 55.3 | 8 | 1.1 | 15 | 2.1 | 1 | 0.1 | 3 | 0.4 | 0 | 0 | 0 | 0.0 | 1 | 0.1 | 3 | 0.4 | 9 | 1.3 | 5 | 0.7 | 0 | 0.0 | 3 | 0.4 | 0 | 0.0 | 0 | 0.0 | 50 | 7 | 82.4 | 2017 |
| Ainslie Kemp | MELB | 21 | 5.3 | 9 | 2.3 | 30 | 7.5 | 18 | 4.5 | 12 | 3.0 | 50.0 | 6 | 1.5 | 8 | 2.0 | 5 | 1.3 | 3 | 0.8 | 0 | 0 | 3 | 0.8 | 1 | 0.3 | 2 | 0.5 | 8 | 2.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 2 | 0.5 | 1 | 0.3 | 0 | 4 | 63.7 | 2017 |
| Akec Makur Chuot | FRE | 29 | 4.8 | 8 | 1.3 | 37 | 6.2 | 20 | 3.3 | 16 | 2.7 | 48.6 | 8 | 1.3 | 2 | 0.3 | 1 | 0.2 | 0 | 0.0 | 6 | 1 | 5 | 0.8 | 0 | 0.0 | 2 | 0.3 | 13 | 2.2 | 11 | 1.8 | 1 | 0.2 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 6 | 64.9 | 2017 |
| Alex Williams | GWS | 47 | 6.7 | 20 | 2.9 | 67 | 9.6 | 35 | 5.0 | 23 | 3.3 | 59.7 | 10 | 1.4 | 6 | 0.9 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 4 | 0.6 | 8 | 1.1 | 1 | 0.1 | 21 | 3.0 | 14 | 2.0 | 1 | 0.1 | 0 | 0.0 | 1 | 0.1 | 1 | 0.1 | 0 | 7 | 88.6 | 2017 |
#View(players)
Club?players$Club[1:10]
## [1] "WB" "ADEL" "GWS" "MELB" "FRE" "GWS" "BL" "GWS" "COLL" "FRE"
Kicks_TOT?players$Kicks_TOT[1:10]
## [1] 9 35 21 21 29 47 29 7 63 4
Kicks_AVG?players$Kicks_AVG[1:10]
## [1] 2.3 4.4 3.0 5.3 4.8 6.7 3.6 1.8 9.0 1.3
table(players$Club)
##
## ADEL BL CARL COLL FRE GWS MELB WB
## 57 58 59 58 62 60 58 57
A plot 2-dimensional barplot usually contains a set of labels on the x-axis corresponding to the categories of the variable and on the y-axis is the number of times each category of the variable appears in the dataset.
Compare the following two plots:
geom_bar() is used to produce the barplottheme_bw() is purely aestethic and simply adds a white backgroundfill=Club do?coord_flip() do?library(ggplot2)
ggplot(data = players,aes(x=Club,fill=Club)) + geom_bar() + theme_bw()
ggplot(data = players,aes(x=Club,fill=Club)) + geom_bar() + theme_bw() + coord_flip()
Discrete and continuous variables are usually summarised and displayed using similar tools. Often, discrete variables can be seen as special case of continuous variables.
table(players$Kicks_TOT)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 40 5 3 9 10 11 9 9 13 6 10 10 4 3 9 7 5 6
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 8 6 5 9 7 6 8 5 4 5 8 9 7 4 5 6 8 10
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 9 7 5 10 5 10 3 8 6 8 10 6 8 4 3 2 7 2
## 54 55 56 57 58 59 60 61 63 64 65 66 67 68 69 71 72 73
## 4 1 1 4 1 2 4 3 2 2 2 1 4 2 1 1 2 1
## 74 75 76 78 79 81 82 84 85 86 87 89 91 96 97 101 102 105
## 1 2 1 1 2 1 3 1 2 1 2 2 1 2 1 1 2 1
## 123 124
## 1 1
summary(players$Kicks_TOT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 28.00 30.23 44.00 124.00
ggplot(data = players,aes(x=Kicks_TOT)) + geom_histogram(colour="white") + theme_bw()
# Alternative for continous variables: Boxplot
ggplot(data = players,aes(x="Tot kicks",y=Kicks_TOT)) + geom_boxplot() + theme_bw()
# alternative way of producing a boxplot
boxplot(players$Kicks_TOT)
For example, summarise plot the number of total kicks per AFL team
library(dplyr)
kicks_by_team <- players %>% group_by(Year,Club) %>%
summarise(Tot.kicks = sum(Kicks_TOT))
kicks_by_team
## # A tibble: 16 x 3
## # Groups: Year [?]
## Year Club Tot.kicks
## <int> <chr> <int>
## 1 2017 ADEL 1052
## 2 2017 BL 977
## 3 2017 CARL 780
## 4 2017 COLL 838
## 5 2017 FRE 817
## 6 2017 GWS 758
## 7 2017 MELB 911
## 8 2017 WB 706
## 9 2018 ADEL 906
## 10 2018 BL 1077
## 11 2018 CARL 757
## 12 2018 COLL 959
## 13 2018 FRE 850
## 14 2018 GWS 848
## 15 2018 MELB 894
## 16 2018 WB 1050
ggplot(data = players,aes(x = Club, y = Kicks_TOT)) + geom_bar(position="dodge",stat="identity") + theme_bw() + facet_wrap(~Year)
# Add title
ggplot(data = players,aes(x = Club, y = Kicks_TOT,fill=factor(Year))) + geom_bar(position="dodge",stat="identity") + theme_bw() + ggtitle("Total kicks by club by year (2017-1018)")
# Flip coordinate and colour by year
ggplot(data = players,aes(x = Club, y = Kicks_TOT,fill=factor(Year))) + geom_bar(position="dodge",stat="identity") + theme_bw() + ggtitle("Total kicks by club by year (2017-1018)") + coord_flip()
# Plot total number of goals instead of kicks
ggplot(data = players,aes(x = Club, y = Goals_TOT,fill=factor(Year))) + geom_bar(position="dodge",stat="identity") + theme_bw() + ggtitle("Total kicks by club by year (2017-1018)") + coord_flip()
# Kicks by goal
ggplot(data = players,aes(x = Kicks_TOT, y = Goals_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip()
ggplot(data = players,aes(x = Kicks_TOT, y = Goals_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip() + facet_wrap(~Year)
# Kicks by handballs
ggplot(data = players,aes(x = Kicks_TOT, y = Handballs_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip()
ggplot(data = players,aes(x = Kicks_TOT, y = Handballs_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip() + facet_wrap(~Year)
What can you say about these plots? Is there a relationship between the number of handballs per player and the number of kicks?
An example of interactive plot
library(plotly)
ggplotly(ggplot(data = players,aes(x = Kicks_TOT, y = Handballs_TOT,label=Player,label=Club)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip() + facet_wrap(~Year))
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`label`)
purl("explore_teams_and_players.Rmd")